CVPN-2346 Implement GSO offload on lightway-server#413
Conversation
|
Code coverage summary for b83189e: ✅ Region coverage 66% passes |
2b45738 to
ed9e475
Compare
b5849fe to
c62619f
Compare
| // Expose the full slab to `recv_gso` as `&mut [u8]`. | ||
| // SAFETY: every byte of the slab was zero-initialized at | ||
| // construction; subsequent iters only ever shrunk `len` or | ||
| // overwrote bytes. We never hand out uninitialized memory. |
There was a problem hiding this comment.
This does not sound safe, as pkt is mutable we could also create new BytesMut and replace it.
So now if you reserve, it might be unintialized.
I think what you want is https://docs.rs/bytes/latest/bytes/struct.BytesMut.html#method.spare_capacity_mut which gives pointer to spare buffer which you can sent to recv_gso
There was a problem hiding this comment.
Because spare_capacity_mut(&mut self) -> &mut [MaybeUninit<u8>], I think we ultimately still need a unsafe cast on the &mut [MaybeUninit<u8>] because tun_rs::AsyncDevice::recv() only takes &mut [u8] in the end. I will think about it.
Interestingly there's also std::io::Read::read_buf in the nightly std which takes MaybeUninit<u8> (which tun-rs could use) but it is never stabilized.
Edit: I changed recv signatures to accept MaybeUninit and do one unsafe cast in the end, I am hoping to drop it once tun-rs accepts MaybeUninit. Please take a look at c064409 👀
Add the `gso` module to lightway-core with VirtioNetHdr definition, checksum helpers, and segment build/count functions for splitting GSO superpackets into individual segments with correct per-segment header fixups (IP ID, TCP seq, checksums). Also add tun-rs workspace dependency to lightway-core and lightway-server Cargo.toml.
Add the `send_gso` method to the OutsideIOSendCallback trait for sending concatenated wire packets via kernel GSO (UDP_SEGMENT). Include todo!() stub implementations in client TCP/UDP, server TCP, and test harnesses to satisfy the trait contract.
Add gso_buf/gso_size fields to TlsIOAdapter so the wolfssl send() callback can buffer raw encrypted segments during GSO processing. Add udp_send_gso to wrap buffered segments with wire headers and send as one sendmsg via the vectored send_gso callback. The implementation uses a zero-copy fast path when no outside plugins are configured: scatter-gather via iovec with a shared header buffer and borrowed slices of the encrypted segment buffer. The plugin path builds each segment as its own BytesMut and enforces the uniform-stride requirement of UDP_SEGMENT.
Add inside_data_received_gso and send_to_outside_gso methods to Connection. These process a GSO superpacket as a single packet through plugins/encoder, then split into per-segment encrypted frames and collect into a wire buffer for batch send via UDP_SEGMENT.
Add offload config field to TunConfig to enable IFF_VNET_HDR on TUN devices. Add recv_gso for raw reads that include the virtio_net_hdr prefix, and prepend a zeroed virtio header on try_send when offload is enabled.
Extend send_to_socket to accept an optional gso_size parameter and build UDP_SEGMENT cmsg for kernel-level segmentation. Implement the real send_gso on UdpSocket using this path.
Add enable_tun_offload config option and wire it through ServerConfig to main. Extract the default inside IO loop into its own function and add inside_io_loop_gso that reads virtio-framed superpackets from TUN, dispatches GSO vs single-packet paths, and sets gso_max_size on the TUN device.
47cc752 to
c12c93b
Compare
Lets the GSO recv loop use BytesMut::spare_capacity_mut() directly, dropping the one-time 65 KB zero-init and one of the two call-site unsafe blocks. The cast back to &mut [u8] now lives only at the syscall boundary in TunDirect::recv_gso, with a comment pointing at the tun-rs upstream gap to track for cleanup.
c12c93b to
c064409
Compare
Description
Implement GSO on server side on DTLS and Expresslane, specifically on bulk server->client traffic. This consistently halves the total syscalls used during bulk transfers, and also improves aggregated server throughput by 2x for multiple clients doing transfers.
When
--enable-tun-offloadis set, the server reads TSO superpackets from the TUN withIFF_VNET_HDR, segments them in userspace, and emits each superpacket as a singlesendmsg(UDP_SEGMENT)instead of N per-segment syscalls. On a single-flow iperf3 reverse test the kernel UDP send path collapses near-completely:udp_sendmsg0.71% → ~0.05%,sock_alloc_send_pskb2.61% → ~0.13%,mlx5e_xmit1.88% → ~0%.Trade-off: kernel work is replaced with userspace work (per-segment IP/TCP/UDP checksum recomputation, segment assembly). Kernel-side wins are clear and measurable; userspace cost is now the dominant factor.
Pacing: each
sendmsg(UDP_SEGMENT)produces a NIC burst of up to N segments. This can exceed receiver socket buffer depth and increase tail drops at peak rates. We will need to revisit better TX pacing under congested links.Future work will focus on compatibility with TUN backends like io_uring, GRO on the server side, and full GSO/GRO on client side, where single-flow workloads should see the biggest visible speedup not in this PR.
Motivation and Context
See ticket CVPN-2346.
How Has This Been Tested?
Types of changes
Checklist:
main